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"A Logic Simulation System" 




5 The present invention relates to a logic simulator and in particular to a method and 
apparatus of improving the efficiency of logic simulations. 

Logic simulation plays an important role in the design and validation of VLSI circuits. 
As circuits increase in size and complexity, there is an ever demanding requirement to 

10 accelerate the processing speed of this design tool. Parallel processing has been 
perceived in industry as the best method to achieve this goal and numerous parallel 
processing systems have been developed. Unfortunately, large speedup figures have 
eluded these approaches. Higher speedup figures have been achieved, but only by 
compromising the accuracy of the gate delay model employed in these systems. A 

15 large communication overhead duetto basic passing of values between processors, 
elaborate measures to avoid or recover from deadlock and load balancing techniques, 
is the principal barrier. 

The ever-expanding size of VLSI (Very Large Scale Integration) circuits has further 
20 emphasised the need for a fast and accurate means of simulating digital circuits. A 
compromise between model accuracy and computational feasibility is found in 
Logic Simulation. In this simulation paradigm, signal values are discrete and may 
acquire in the simplest case logic values 0 and 1. More complex transient state 
signal values are modelled using up to 9-state logic. Logic gates can be modelled 
25 as ideal components with zero switching time or more realistically as electronic 
components with finite delay and switching characteristics such as inertial, pure or 
ambiguous delays. 



30 



Due to the enormity of the computational effort for large circuits, the application of 
parallel processing to this problem has been explored. Unfortunately, large 
speedup performance for most systems and approaches have been elusive. 



Sequential (uni-processor) logic simulation can be divided into two broad 
categories Compiled code and Event-driven. simulation (Breur et al: Diagnosis and 




Reliable Design of Digital Systems. Computer-Science Press, New York (1976)). 
These techniques can be employed in a parallel environment by partitioning the 

circuit amongst processors. In compiled code simulation, all gates are evaluated at 

all time steps, even if they are not active. The circuit has to be levellised and only 
5 unit or zero delay models can be employed. Sequential circuits also pose 
difficulties for this type of simulation. A compiled code mechanism has been 
applied to several generations of specialised parallel hardware accelerators 
designed by IBM, the Logic Simulation Machine LSM (Howard et al: Introduction to 
the IBM Los Gatos Simulation Machine. Proc IEEE Int. Conf. Computer Design: 

10 VLSI in Computers. (Oct 1983) 580-583), the Yorktown Simulation Engine 
^Sf^ (Pfister: The Yorktown Simulation Engine. Introduction 19 th ACM/IEEE Design 

Automation Conf, (June 1982), 51-54) and the Engineering Verification Engine 
EVE (Dunn: IBM's Engineering Design System Support for VLSI Design and 
Verification. IEEE Design and Test Computers, (February 1984) 30-40 and 

15 performance figures as high as 2.2 billion gate evaluations/sec reported. Agrawal 
et al: Logic Simulation and Parallel Processing Intl Conf on Computer Aided 
Design (1990), have analysed the activity of several circuits and their results have 
indicated that at any time instant circuit activity (i.e. gates whose outputs are in 
transition) is typically in the range 1% to 0.1%. Therefore , the Effective number of 

• 20 gate evaluations of these engines is likely to be smaller by a factor of a hundred or 
more. Speedup values ranging from 6 to 13 for various compiled coded 
| benchmark circuits have been observed on the Shared memory MIMD Encore 

Multimax multiprocessor by Soule and Blank: Parallel Logic Simulation on General 
purpose machines. Proc Design Automation Conf, (June 1988), 166-171. A SIMD 
25 (array) version was investigated by Kravitz (Mueller-Thuns et al: Benchmarking 
Parallel Processing Platforms: An Application Perspective. IEEE Trans on Parallel 
and Distributed systems, 4 No. 8 (Aug 1993) with similar results. 

The intrinsic unit delay model of compiled code simulators is overly simplistic for 
30 many applications. 

Some delay model limitations of compiled code simulation have been eliminated in 
parallel Event-driven techniques. These parallel algorithms are largely composed of 
two phases; a Gate evaluation phase and an Event-scheduling phase. The gate 



evaluation phase identifies gates that are changing and the scheduling phase puts the 
gates affected by these changes (the fan-out gates) into a time-ordered linked 
schedul e la st , do t c fmin c cfr fey-tho curee nHfme -a n rf-the=dete^ o f- t h^ae&ye^ga tea . Cou l e 
and Blank: Parallel Logic Simulation on General purpose machines. Proc Design 
Automation Conf, (June 1988), 166-1 71 and Mueller-Thuns et al: Benchmarking 
Parallel Processing Platforms: An Application Perspective. IEEE Trans on Parallel and 
Distributed systems, 4 No 8 (Aug 1993) have investigated both Shared and Distributed 
memory Synchronous event MIMD architectures. Again, overall performance has been 
disappointing the results of several benchmarks executed on an 8-processor Encore 
Multimax and an 8-processor iPSC-Hypercube only gave speedup values ranging 
from 3 to 5. 

Asynchronous event simulation permits limited processor autonomy. Causality 
constraints require occasional synchronisation between processors and rolling back of 
events. .. .Deadlock between processors must _be resolved. JXhandy, Misra: 
Asynchronous Distributed Simulation via Sequence of parallel Computations. Comm 
ACM 24(ii) (April 1981), 198-206 and Bryant: Simulation of PacketXommunications 
Architecture Computer Systems. Tech report MIT-LCS-TR-188. MIT Cambridge 
(1977) have developed deadlock avoidance algorithms, while Briner Parallel Mixed 
Level Simulation of Digital Circuits Virtual Time. Ph.D. thesis. Dept of El. Eng. Duke 
University, (1990) and Jefferson: Virtual time. ACM Trans Programming languages 
systems, (July 1985) 404-425 have explored algorithms based on deadlock recovery. 
The best speedup performance figures for Shared and Distributed memory 
asynchronous MIMD systems were 8.5 for a 14-processor system and 20 for a 32- 
processor BBN system. 

Optimising strategies such as load balancing, circuit partitioning and distributed 
queues are necessary to realise the best speedup figures. Unfortunately, these 
mechanisms themselves contribute large Overhead communication costs for even 
modest sized parallel systems. Furthermore, the gate evaluation process despite 
its small granularity, incurs between 10 to 250 machine cycles per gate evaluation. 

The present invention overcomes the problems inherent in logic simulators by 
implementing an Associative memory architecture which comprises an Associated 
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Parallel Processor for Logic Event Simulation, hereinafter * referred to in this 
specification as APPLES, which is specifically designed for parallel discrete event logic 

cimnlatinn APPI perform* gatP frv/aluatinnc in — mPmnn/ fmfl rftplnTrftlS 

interprocessor communication with a scan technique. Furthermore, the scan 
mechanism is well disposed to parallelisation and a wide range of delay models is 
feasible. The invention has been implemented as a Verilog model but is not limited to 
such. 

Detailed Description of the Invention 

According to the present invention, the essential elemental tasks for parallel logic 
simulation are: 



1. Gate evaluation. 
15 2. Delay model implementation. 
3. Updating fan-out gates. 

The design framework for a specific parallel logic simulation architecture, the 
essential elemental simulation operations are identified that can be performed in 
20 parallel and minimising the tasks that support these operations and which are 
totally intrinsic to the parallel system. 

Activities such as event scheduling and load balancing are perceived as 
implementation issues which need not be incorporated necessarily into a new 
25 design. An important additional critique was that the design must execute directly 
in hardware as many parallel tasks as possible, as fast as possible but without 
limiting the type of delay model. 

The present invention, taking account of the above objectives, incorporates 
30 several special Associative memory blocks and hardware in the APPLES 
architecture. 



o o 



The Gate evaluation/Delay model implementation and Update/Fan-out process 
will be explained with reference to the APPLES architecture with reference to Fig. 

5 A gate can be evaluated once its input wire values are known. In conventional uni- 
processor and parallel systems these values are stored in memory and accessed by 
the processor(s) when the gate is activated. In APPLES, gate signal values are stored 
in associative memory words. The succession of signal values that have appeared on 
a particular wire over a period of time are stored in a given associative memory word in 

10 a time ordered sequence. For instance, a binary value model could store in a 32-bit 
word, the history of wire values that have appeared over the last 32 time intervals. 
Gate evaluation proceeds by searching in parallel for appropriate signal values in 
associative memory. Portions of the words which are irrelevant (e.g. Only the 4 most 
recent bits are relevant for a 4-unit gate delay model) are masked out of the search by 

15 the memory's Input and Mask register combination. For a giverugate type (e.g. And, 
Or) and gate delay model there are requirements on the, structure of the input signals 
to effect an output changeJEach pattern search in associative memory detects those 
signal values that have a certain attribute of the necessary structure (e.g. Those 
signals which have gone high within the last 3 time units). Those wires that have all 

20 the attributes indicate active gates. The wire values are stored in a memory block 
designated Associative Array1b(Wonj-line-register Bank). Only those gate types 
relevant to the applied search patterns are selected. This is accomplished by tagging 
a gate type to each word. These tags are held in Associative Array 1a. A specific gate 
type is activated by a parallel search of the designated tag in Associative Array 1a. 

25 

This simple evaluation mechanism implies that the wires must be identified by the type 
of gate into which they flow since different gate types have different input wire 
sequences that activate them. Gates of a certain type are selected by a parallel 
search on gate type identifiers in Associative Anayla. 

30 

Each signal attribute corresponds to a bit pattern search in memory. Since several 
attributes are normally required for an activated gate, the result of several pattern 
searches must be recorded. These searches can be considered as Tests on words. 





The result of a test is either successful or not This can be recorded as single bit in a 
corresponding word in another register held in a register bank termed the Test-result- 
rogictcr Bonk. S i nce oach gate i s aooumed to have two inputs (inverters an d multiple 
input gates are translated into their 2-input gate circuit equivalents) tests are combined 
5 on pairs of words in this bank. This combination mechanism is specific to a delay 
model and defined by the Result-activator register and consists of simple And or Or 
operation between bits in the word pairs. 

The results of each combining each word pair, the final stage of the gate evaluation 
10 process, are stored as a single word in another associative array, the Group-result 
register Bank. Active gates will have a unique bit pattern in this bank and can be 
identified by a parallel search for this bit pattern. Successful candidates of this search 
set their bit in the 1-bit column register Group-test Hit list. 

15 The APPLES gate evaluation mechanism selects gates of a certain type, applies a 
sequence of bit patterns searches (Tests) to them and ascertains the active gates by 
recording the result of each pattern search and determining those that have fulfilled all 
the necessary tests. This mechanism executes gate evaluation in constant time — the 
parallel search is independent of the number of words. This is an effective linear 

20 speedup for the evaluation activity. It also facilitates different delay models — a delay 
model is defined by a set of search patterns. 

Active gates set their bits in the column hit list. A Multiple Response Resolver scans 
through this list. The resolver can be a single counter which inspects the entire list from 
25 top to bottom which stops when it encounters a set bit and then uses its current value 
as a vector for the fan-out list of the identified active gate. This list has the addresses 
of the fan-out gate inputs in the Input-value register Bank. The new logic value of the 
• active gates are written into the appropriate word of this bank. 

30 It then clears the bit before decrementing through the remainder of the list and 
repeating this process. All hit bits are Ored together so that when all bits are clear this 
can be detected immediately and no further scanning need be done. 



Several Scan registers can be introduced to scan the column hit list in parallel. 
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Each operates autonomously except when two or more registers simultaneously 
detect a hit; a Clash has occurred. Then each scan register must wait until it is 
-Qffeft ra fHy dlluwbiJ to a c cess a g rd--ttpdate ^ ^ ieyisU^ s c a ns a r r 

equal size portion. The frequency of clashes depends on the probability of a hit 
for each scan register, typically this probability is between 0.01 and 0.001 for 
digital circuits. The timing mechanism in APPLES enables only active gates to be 
identified and the multiple scan register stmcture provides a pipeline of gates to be 
updated for the current time interval without an explicit scheduling mechanism. 
The scheduler has been substituted by this more efficient parallel scan procedure. 



When all gate types have been evaluated for the current time interval all signals are 
updated by shifting in parallel the words of the Input-value register into the 
corresponding words of the Word-line register bank. For 8 valued logic (i.e. 3 bits for 
each word in the Input-value register) this phase requires 3 machine cycles. The input- 
15 value register bank, can be implemented as a multi-ported memory system which 
allows several input values to be updated simultaneously provided that the values are 
located in different memory banks. 

The APPLES bit shift mechanism has made the role of a scheduler redundant. 
20 Furthermore, it enables the gate evaluation process to be executed in memory, 
thereby avoiding the traditional Von Neumann bottleneck. Each word pair in 
Array 1b is effectively a processor. Major issues which cause a large overhead in 
other parallel logic simulation are Deadlock and Scheduling issues. 

25 Deadlock occurs in the Chandy-Misra algorithm due to two rules required for 
temporal correctness, an Input waiting rule and an Output waiting rule. Rule one 
is observed by the update mechanism of APPLES. For any time interval T to T j+ i, 
all words in Array 1b reflect the state of wires at time T t and at the end of the 
evaluation and update process all wires have be updated to time T i+ i. All wires 

30 have been incremented by the smallest timestamp, one discrete time unit. Thus at 
the start of every time interval all gates can be evaluated with confidence that the 
input values are correct. The Output rule is imposed to ensure that a signal values 
arrive for processing in non-decreasing timestamp order. This is guaranteed in 
APPLES, since all signal values maintain there temporal order in each word 




through the shift operation. Unlike the Chandy-Misra algorithm deadlock is 
impossible as every gate can be evaluated at each time interval. 

There is no scheduler in the APPLES system. Complex modelling such as Inertial 
delays have confronted schedulers with costly (timewise) unscheduling problems. 
Gates which have been scheduled to become active need to be de-scheduled 
when input signals are found to be less than some predefined minimum duration. 
This with the normal scheduling tasks contributes to an onerous overhead. 



10 Fig. 2 displays the equivalent mechanism in APPLES. An AND gate has two inputs 
a and b, assume that unless signals are at least of three units duration no effect 
occurs at the output, the simulation involves only binary values 0 and 1 and each 
bit in Array 1b represents one time unit. Signal b is constant at value 1, while signal 
a is at logic 1 for two time units, less than the minimum time. This will be detected 

15 by the parallel search generated by the Input and Mask register combination and 
the gate will not become active. 

The circuit is now ready to be simulated by APPLES and is parsed to generate the 
gate type and delay model and topology information required to initialise 
20 associative arrays 1a, 1b and the fan-out vector tables. There is no limit on the 
number of fan-out gates. 

Referring again to Fig. 1, the functional blocks of the APPLES processor are 
shown. The blocks pertinent to gate evaluation are Associative Arrayla, Input- 

25 value-register Bank, Associative ArraylB, Test-result-register Bank, Group- 
result register Bank and the Group-test Hit list. Apart from the associative 
arrays, the Group-result register bank has parallel search facilities. Regardless of 
the number of words in these structures can be searched in parallel in constant 
time. Furthermore, the words in the Input-value-register Bank and Associative 

30 Arraylb can be shifted right in parallel while resident in memory. 



The APPLES processor assumes that the circuit to be simulated has been 
translated into an equivalent circuit composed solely of 2-input logic gates. Thus, 
every gate has two wires leading into it (an inverter has two wires from one 



source). These wires are organised as adjacent words in Associative Array 1b 
called a Word set. Associative Array 1a contains identifiers from every wire 

■ nHu-ntnrl thn typo of gnto ^ n H jr\nyf i ptn ^/Kjr - h fho U » r Q ig rnnnQrfor ^ Thjp 

identifiers are in an associative memory that when a particular gate evaluation test 
5 is. executed, putting the relevant bit patterns into Input-regla and Mask-reg1a 
specifies the gate type. All wires connected to such gates will be identified by a 
parallel search on Associative Array 1a and these will be used to activate the 
appropriate words in Associative Arraylb (Word-line register bank). Thus, gate 
evaluation tests will only be active on the relevant word sets. 

10 

The Input-value register bank contains the current input value for each wire. The 
three leftmost bits of every word in Associative arraylb are shifted from this bank 
in parallel when all signal values are being updated by one time unit. During the 
update phase of the simulation, fan-out wires of active gates are identified and the 
15 corresponding words in the Input-value register bank amended. 

Simulation progresses in discrete time units. For any time interval, each gate type 
is evaluated by applying tests on associative array 1b and -combining and 
recording results in the neighbouring register banks. Regardless of the number of 

20 gates to be evaluated this process occupies between 10 machine cycles for the 
simplest, to 20 machine cycles for the more complex gate delay models, see Fig. 
3. Once the fan-out gate inputs have been amended, all wires are time 
incremented through a parallel shift operation of 3 machine cycle duration. In 
general, for 2 N valued logic N shift operations are required to update all signal 

25 values. 

Of the entire simulation cycle, the only task affected by the circuit size is that of 
scanning the Hit list. As a circuit grows in size the list and sequential scan time 
expand proportionately. Analogous to the conventional communication overhead 
30 problem, the APPLES architecture needs to incorporate a scan mechanism which 
can effectively increase the scan rate as the hit list expands. This has been 
investigated and implemented as a multiple scan register structure. 

The series of signal values that appear on a wire over a period of discrete time 



- — — — -10- 



units*can be represented as a sequence of numbers. For example, in a binary 
system if a wire has a series of logic values, 1,1,0 applied to it at times to, ti and t 2 . 
r espectively, where tu* t i < U Tho history of oigna l va l uoc on this wir e can be 
denoted as a bit sequence 011; the further left the bit position, the more recent the 
5 value appeared on the wire. 
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Different delay models involve signal values over various time intervals. In any 
model, signal values stored in a word which are irrelevant are masked out of the 
search pattern. 



The process of updating the signal values of a particular wire is achieved by 
shifting right by one time unit all values and positioning the current value into the 
leftmost position. Associative Arraylb can shift right all its words in unison. The 
new current values are shifted into Associative Arraylb from the Input-value 
15 register bank. 

With wire signal values represented as bit sequences in associative memory 
words, the task of Gate evaluations can be executed as a sequence of parallel 
pattern searches. Figure (4) depicts the scenario where 8-vaIued logic has been 
20 employed and the AND gate has been arbitrarily modelled as having a 1 unit 
delay. 

Any gate which has ANY input satisfying Ti and NO(NONE) input satisfying T 2 will 
transition to 0. 

25 

Consequently, to determine if the output of this gate is going to transition from 
logic 1 to logic 0 it is necessary to know the signal values at the current time tc and 
tc-i. The current values are contained in the leftmost three bits of the word set. 
Figure 4 declares the current values on the two inputs as logic 1-111' and logic 
30 0='000' and the previous values as both logic 1. 



To ascertain if this AND gate has an output transition to logic 0, two simple bit 
pattern tests will suffice. If ANY current input value is logic 0 (Test Ti) and NONE 
of the previous input values are logic 0 (Test T 2 ). then the output will change to 



o 



o 



TT 



logic 0. These are the only conditions for this delay model, which will effect this 
transition. With associative memory any portion of a word can be active or passive 
i n g sc orch. Thus, putting '0 00 ' a nd '111' int o t hQ4oftmost 4kg ee-&^^ Sp^rrh 
and Mask registers of Associative Array 1b can execute test Ti. Test T2 can be 
executed by essentially the same test on the next leftmost three bit positions. 




In general each test is applied one at a time. The result of test Tj on wordj is stored 
in the i ,h bit position of wordj in the Test-result register bank. A '1' indicates a 
successful test outcome. For each word set, for every test it is necessary to know if 

10 ANY or BOTH or NONE of the inputs passed the particular test. If the i ,h bits of 
wordj and wordj-1 in the Test-result register bank are Ored together and the result 
of this operation is T, then at least one input in the corresponding word set 
passed the test Ti — the ANY condition -test. If the result of the operation is '0' then 
no inputs passed test Tj - the NONE condition test. Finally, if the i th bits are Anded 

15 together and the result is '1' then BOTH have passed test Ti. 





The Result-activator register combines results which are subsequently ascertained 
by the Group-result register. The logical interaction is shown in Fig: 5. 

20 The And or Or operations between the bit positions is dictated by the Result 
Activator register. A '0' in the i th bit position of the Result Activator register 
performs an Or action on the results of test Tj for each word set in the Test-result 
register bank and conversely a T an And action. Each i th And or Or operation is 
enacted in parallel through all word set Test result register pairs. 

25 

The results of the activity of the Result Activator register on each word set Test 
result register pair are saved in an associated Group Result register. Apart from 
retaining the results for a particular word set, the Group Result registers are 
composite elements in an associative array. This facilitates a parallel search for a 
30 particular result pattern and thus identifies all active gates. These gates are 
identified as Hits (of the search in the Group result register bank) in the Group-test 
Hit List. 




^T2- 



Returning to the AND gate transition to logic '0' example, an AND gate will be 
identified as fulfilling the test requisites, ANY input passes test Ti and NONE 

passing t e st T;, if its corresponding Croup Result register has the bit sequence '10' 

in the first two bit positions. 

5 

The APPLE components involved in the Gate evaluation phase and their 
sequencing are shown in Fig. 6. 

Complex delay models such as Inertial delays require conventional sequential 
10 and parallel logic simulators to Unschedule events when some timing critique is 
violated. This expends an extremely time consuming search through an event 
list. In the present invention, inertial delays only require verification that signals 
are at least some minimum time width; implementable as a single pattern search. 

15 An Ambiguous delay is more complicated where the statistical behaviour of a 
gate conveys an uncertainty in the output. A gate output acquires an unknown 
value between some parameters t mm (M time units) and t m ax (N time units). Using 
4-valued logic, APPLES detects an initial output change to the unknown value at 
time tmm, followed by the transition from unknown value to logic state € 0' at time 

20 Uax, see Fig. 7. Hazard conditions, where both inputs simultaneously switch to 
converse values can also be detected, which is illustrated in Fig. 7. 

For each gate type, the evaluation time T ga te-evai remains constant, typically ranging 
from 10 to 20 machine cycles. The time to scan the Hit list depends on its length 
25 and the number of registers employed in the scan. N scan registers can divide a 
Hit list of H locations into N equal partitions of size H/N. Assuming a location can 
be scanned in 1 machine cycle, the scan time, Tscan is H/N cycles. Likewise it will 
be assumed that 1 cycle will be sufficient to make 1 fan-out update. 

30 For one scan register partition, the number of updates is (Prob h it)H/N. If all N 
partitions update without interference from other partitions this also represents the 
total update time for the entire system. However, while one fan-out is being 
updated, other registers continue to scan and hits in these partitions may have to 



wait and queue. The probability of this happening increases with the number of 
partitions and is given by N Ci(Prob hi i)H/N. 





A eiash^pceuirs«-vyhenMwo or morOTregisters- simultaneously fdetect a hit and 
5 attempt - tttnaekess the single-ported^fan-out memory. In theses-circumstances, a 
semaphore arbitrarily authorises waiting registers accesses -to memory. The 
number of clashes during a scan is, 

No. Clashes = (Prob of 2 hits per inspection) x H/N 
10 + Higher order probabilities. 

(1) 

The low activity rate of circuits (typically 1%-5% of the total gate count) implies that 
higher order pr©babilitieS"ean-.be^ign0red^Assunne a uniform^andom -distribution of 
hits andr let -Pro'bhit-be" the. probability that the register will-encounter a hit on an 
15 inspection. Then {l^besomes, 

mhlol Clashes ^ N C 2 (Rrofehit) 2 x«ftlN 

(2) 

Thus, TN^the average total time required to scan- and update the -fan-out lists of a 
20 partition for a particular gate type is, 

Tn = Tgate-eval + Tscan + Tupdate + Tclash 

= Tgat^vai + H/N + N C, (Prob hi t)H/N + N C 2 (Probhi,) 2 x H/N 

(3) 

25 

Since ail partitions are scanned in parallel, Tn also corresponds to the processing 
time for an N scan register system. Thus, the speedup S p =Ti/Tn. of such as 
system is, 

30 Ti/Tn= Tgate-eval + Tscan + Tupdate 

T ga , e - evi u+ H/N + N Ci (Prob hi .)H/N + N C 2 (Probhi.) 2 x H/N 

(4) 

35 



|4- = 




10 




Eqt (4) has been validated empirically. Predicted results are within 20% of 
observed for sample circuits C7552 and C2670 and 30% for C1908. Non- 
uniformity of hit distribution appears tn hp thp ransp for this rlpviatinn 

Differentiating Tn w.r.t N and ignoring 2 nd order and higher powers of Prob™ the 
optimum number of scan registers Noptimum and corresponding optimum speedup 
Soptimum is given by, 

Noptimum = (V2)/Probhit (5) 
Soptimum =1/(2.4 x Probhit ) (6) 



Thus, the optimum number of scan registers is determined inversely by the 
probability of a hit being encountered in the Hit list. In APPLES, the important 
15 processing metric is the rate at which gates can be evaluated and their fan-out lists 
updated . As the probability of a hit increases there will be a reciprocal increase in 
the rate at which gates are updated. Circuits under simulation which happen to 
exhibit higher hit rates will have a higher update rate. 

20 When the average fan-out time is not one cycle, Prob hit is multiplied by Fout, where 
Fout is the effective average fan-out time. 

A higher hit rate can also be accomplished through the introduction of extra 
registers. An increase in registers increases the hit rate and the number of 
25 clashes. The increase halts when the hit rate equals the fan-out update rate, this 
occurs at N op timum. This situation is analogous to a saturated pipeline. Further 
increases in the number of registers serves to only increase the number of clashes 
and waiting lists of those registers attempting to update fan-out lists. 



30 In this embodiment of the present invention, the APPLES processor was 
implemented, validated and evaluated again using a Verilog model. However, it is 
appreciated that the APPLES system can be implemented in any other simulation 
language, for example, VHDL. The following are some specific examples of 
simulations carried out by the APPLES system. ISCAS-85 benchmarks C880(622 
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gates); C1 908(786 gates) and C2670(1736 gates) and C7552(4392 gates) were 
simulated to generate statistics and performance figures. The gate counts refer to 

=©^pCndOd C ir CU itC. Aitr>jcr>gr>r imoiyi rnmnibH frr> m f^o QVQf>l Of 1Q +fhh ppf 



circuit, each trial being a distinctinput vector exercised over 10000 to 10000 
5 machine cycles for the smallest to the largest circuit respectively. 

The average number of cycles, taking into account the scan and fan-out cycle 
times, executed per gate processed for the 3 benchmark circuits are shown in 
Figure 8. The fixed size overheads such as gate evaluation and time 
10 incrementation have been excluded from this analysis, since in a smaller circuit 
they form a proportionately larger overhead. Excluding these overheads is 
representative of large circuits where the fixed overheads are insignificant to the 
scan and update times. 



15 Naturally, as more registers are employed, the average cycle time per gate 
processed reduces. In the first column of Figure 8, depicting cycle times for one 
scan register/ the variation in performance is attributable to the distribution of hits 
in the Hit-list. As more registers are introduced, the number of cycles per gate is 
progressively dominated by the number of cycles required to update the fan-out 

20 lists; the scan time becomes less significant. 

Figure 9 illustrates the speedup performance for the circuits. Again, the adjusted 
figures give a more balanced analysis. As the fan-out updates converge to the 
fan-out memory bandwidth, maximum speedup is attained. 

25 

For comparison purposes Figure 10 uses data from Banerjee: Parallel Algorithms 
for VLSI Computer-Aided Design. Prentice-Hall, 1994 which illustrates the 
speedup performance on various parallel architectures for circuits of similar size to 
those used in this paper. This indicates that APPLES consistently offers higher 
30 speedup. 



The following from pages 16 to 42 is an example of an implementation of the 
present invention in software written in Verilog. 



• .... • 



Verilog Description of APPLES 




Associative Arrayla 

Description: Each word of this array holds a bit sequence identifying the gate type input 
connection of a wire, in the corresponding position in Associative Arraylb. The input/mask 
register combination defines a gate type that will be activated for searching in Associative 
Arrayla. Words that successfully match are indicated in a 1-bit column register. The array also 
has write capabilties. 

module Ary_la ( Input_regla , Mask_r egla , Adr_regla , Clock, 
Search_enblla, Write_enblla, Activ_lstla) ; 

Xnput_regla, Mask.regla, Adr.regla are the Input, Mask and Address registers 
of Associative Arrayla. 

When Seaxch_6nblla is set, the negative edge of Clock initiates a parallel 
search. 

Activ— lstla is a column register that indicates those words in Associative 
Arrayla which compared successfully with the search pattern. // 

parameter Axy_la_wdth=7 ; 
parameter Aryla_size=16383 ; 
integer Ary_index; 

input Clock, Search__enblla, Wr ite_enblla ; 

input [Ary_la_wdth:0] Input_regla, Mask_regla, Adr_regla; 

output (Aryla_size: 0 ] Activ_lstla; 
eg (Aryla_size : 0 ] Activ_lstla; 

9 [ Ary_la_wdth : 0 ] Aryla_ass_mem[0 : Aryla_size] , Temp__reg; 

initial 
begin 

. Sreadmemb ( tt Ary la . dat " , Aryla_ass_mem) ; 

// Aryla.dat is the data file defining the gate and model types in the circuit.// 

for (Ary_index=0; Ary_index<=Aryla_size; Ary_index=Ary_index+l ) 
begin 

Ac tiv_l s 1 1 a [ Ar y_index ] = 0 ; 
end 

end 

always @{negedge Clock) 
begin 

if (Search_enblla) 
begin 1 

for (Ary_index=0 ; Ary_index<=Aryla_size; Ary_index=Ary_index+ 1 ) 
begin 

Temp_reg=Aryla_ass_mem [Ary_ index] ; 

if ( (-Mask_regla | (Input_regla & Temp_reg) | 

(-Input_regla & -Temp_reg) ) = = 8 ' hf f ) 
Activ_lstla [Ary_ index] =1 ; 

else 




end 
end 



- if - 

Act iv_lstla [ AryJBBex] =0 ; 



if (Write_enblla) Aryla_ass_mem[Adr_regla] = Input_regla; 
end . 



endmodule 

Associative Arraylb 



Description: E*very^w,ord»inatfais«»ara values on a 

specific wire. The most -recent values -being leftmost in each word. All words can be 
simultaneously shifted nighty -effecting a «one#unit time increment on all-wires. The signal values 
are updated from a-fcbit^Golumn^E^^ array has parallel -search and read and write 

capabilities. 

module Ary_lb ( Search_reglb, Mask_reglb, Adr_reglb, Datain_reglb, 

Dataout_reglb,Hit_buffr_reglb, Shft_enbl, Search_enblib, 



Write_enbl, Read_enbl , Clock, Input_bit, 
Word_line_enbl ) ; 




^ Search_reglb, Mask_reglb r Adr_reglb, Datain_reglb, Dataoufc_reglb are the 

earch,Mask, Address , Data-in and data-out registers of Associative Arraylb. 
When Seajrcfc_anbllb is set, the negative edge of Clock initiates a parallel 
search. Likewise, a ^ead or- write 3^pe ration is^fexecs^feed^on^fehe ^negative edge of 
the clock if Write_*enbl or *Read_enbl »is «as*serfeed. 

A parallel search 4s initialed on a negative -edge ~of™the Clock if Search_enbllb is 
set. This search is only active on those - words that^are^p&imed for searching by 
the Word_lina_oxibl^c^li2inn^regsiter.^he^bits in thdsa^gisitema^e set/cleared by 
Activ.lstla of Associative.4A:r^ayla. ^This e££ ec^i^ely*liise>Lec4is^g^tes of a certain 
gate type and delay-model .^Wor'ds^tsttatisxnatch^are ^dentl^i^d^by^bl^ being set in the 
corresponding posifetori' in Hit3uf frjrSglb. 

Words are . shiftednr%ht in^paraMeP %ith the -leftmost fsr bit ? being taken from 
Input_bit . / / 



rameter Arylb_mem_size=163 83 ; 
rameter Wlr_wrdsize =31; 
parameter Shf t_dly=2 ; 
parameter Adr_r eg_bi t s = 1 3 ; 

input [Wlr_wrdsize:0J Search_reglb, Mask_reglb, Datain_reglb; 
input [Arylb_mem_size:0] Input_bit , Word_line_enbl ; 

input Clock ; 

in P ut Shf t_enbl f Search_enbllb, Write_enbl ( Read_enbl ; 



reg [Wlr^wrdsize : 0] Temp^egl ; 

reg [Wlr_wrdsize : 0 ] Wlr^Ass_mem[0 : ArylbMmem__size] 
input [Adr_reg_bits : 0] Adr_reglb; 

ou t pu t [ Ary 1 b_mem_s i z e : 0 ] Hi t _bu f f r _r eg 1 b ; 
reg [Arylb_mem_size : 0 ] Hit_buf f r_reglb; 

ou tpu t [ Wl r_wr ds i z e : 0 ) Da t aou t_r eg 1 b ; 
reg [Wlr_wrdsize : 0 ] Dataout_reglb; 
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integer * Mem_indx ; 



initial Sreadmembt ■Arravlb.dat:" .W lr a** m<* m ) 
//Axxaylb-dat is the file which initialises all the words in Arrraylb to the 
Unknown value.../-/- . . 

always 8(negedge Clock) 
begin 

i f (Ch£t_onbl) 

begin 

for (Mem_indx=0; Mem_indx<=Arylb _mem_size ; Mem_indx= Mem_indx + 1) 
begin 

Temp_regl = Wlr_Ass_mera [Mem_indx] ; 
Temp_regl= Temp_regl >> 1; 

Temp_regl[Wlr_wrdsize] = Input_bit [Mem_indx] ; 
Wlr_Ass_mem[Mem_indx] = Temp__regl ; 
end 

end 
else 

i f ( Search_enbl lb) 
begin 

for (Mem_indx=0; Mem_indx< = Airy lb_mem_s i z e ; Mem_indx = Mem_indx +1) 
begin 

if (Word_line_enbl [Mem_indx] ) 
begin 

Temper eg 1 = Wlr_Ass_mem [Mem.indx] ; 

if ( ( ~Mask_reglb | (Search_reglb & Temp_regl) | 

(-Search.reglb & -Temp_regl) ) ==32 • hf f f f f f f f ) 

begin 

Hit_buf fr_reglb[Mem_indx] = 1; 
end 
else 
begin 

Hitjbuf f r_reglb[Mem_indx] = 0; 
end 

end 
else 

Hit_buf fr_reglb[Mem_indx] = 0; 
end 

end 
else 

if (Write_enbl) 

Wlr_Ass__mem[Adr_reglb] = Datain_reglb; 

else 

if (Read_enbl) 

Dataout_reglb = Wlr_Ass_mem [Adr_reglb] ; 



end 
endmodule 



• # 



Test-result register Bank 



Description ; When an i"* ami eh fa exLLuttd uu AmjUaUve Ai ray lby if >wurd^iu AlTaylb matches 
the search patteraftthenwbiti* in*word r of*the Test-result register bank*w.Hl be»set, otherwise it is 
cleared: The Result^twator-registeRspeeffies,the logical combination between pairs of words( a 
gate's set of inputs)rT*e result-of this^combination ofc-word, pairs, is a»colu*nn register (half the 
length of the number'bf word pairs). 

module Tst_rslt_reg_bank ( Inp_buf f r_reg, Trr_wrt_enbl , Comb_enbl , Clock. 

Ou t_bu f f r_r eg , Rs 1 t_ac t_reg , Wr i t e_pos , Rs e t ) ; 

// lnp_buffr_rea is a column of bits describing the outcome of a search on each 
word in Arraylb. This bit column is written into a column of the Test-result 
register bank on the negative edge of Clock when Trr_wrt_enbl is asserted The 
ansition of this coulmn is defined by Writa_pos. 

I d P /« rS , combined according to the bit sequence in Rslt_act_reg . A '0' in 

Wi of . Rsl . t - a f. t r r ^5 RS the i bits in each word pair and produces the result for 
tch pair m Out_buf £ r_reg . This combination is executed on the negative edge of 
Clock when Comb.anbl is asserted. Rsat resets all the bits in the Test-result 
register bank.// 



ci 



parameter Trrrwor^size^?; 
parameter Trr_mem^size=163 83 ; 
parameter Trr_out3*~s*i*ze=8191 ; 
parameter T r r_wd t s pe<s.= 2 ; 

£eg [ Trr_word_size : tf] ~Trj^ar^[0 :Trr*men^s&ze] ; 
reg [Trr_word_size : 0 ] ^empifisegl r^empr.reg2 ; 
reg Rsl traction; 



•'I 
P 1 



put [Trr_mem_size:0] Inp_buf f r.reg; 

ut [Trr__word_size: 0] Rslt_act_reg; 

\it [Trr_wdth_spec: 0] WriteJ>os; 
put Clock; 
input Trr_wrt_enbl ; 
input Coirtb_enbl; 
input Rset; 



output [Trr_out_size; 0] Out_buf f r_reg; 
reg[Trr_out_size: 0] Out_buf f r_reg; 

integer Bank_index, i ; 



always @(negedge Clock) 
begin 

i f ( Trr_wrt__enbl ) 
begin 

for (Bank_index=0; Bank_index<=Trr_mem_size ; Bank_index=Bank index+1) 
begin ~" 

Temp_regl=Trr_array[Bank_index] ; 

Temp_regl [Write_pos] =Inp_bu£f r_reg [Bank_index) ; 

Trr array [Bank_index] = Temper eg 1 ; 

end 
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end 



else 



if (Comb_enbl) 
begin 

Rs 1 t_ac t ion=Rs 1 t_ac t_r eg [ Wr i tempos ] ; 
begin 

for (Bank_index=0 ;Bank_index<Trr_mem_size;Bank_index=Bank_index+2 ) 
begin 

Temp_regl=Trr_array [Bank_index] ; 
Temp_reg2=Trr_array [ Bank_index+1 ] ; 
if (Rslt_action==0 ) 

Out_buf fr_reg[Bank_index/2] = (Temp_regl [Write_pos] | 

Temp_reg2 [Wri tempos] ) ; 

else 

Out_buf f r_reg [ Bank_index/ 2 ] =Temp_regl [Wri tempos] & 

Temp_reg2 [Wri tempos] ; 




end 
end 



end 



else 

if (Rset) 
begin 

for ( Bank_index=0 ; Bank_index< =Tr r_mem_s i ze ; Bank_index=Bank_index+ 1 ) 
Trr^array (Bank^index) =8 ' hOO ; 

end 



end 

^Kmodule 



* 
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Group-result register Bank 



Description: The-result^fdthe Gombination-of^word pairs in the H?est*reslilt agister is written as 
a column of bits into the .Group-result agister bahkPWhen all combination results have been 
generated a parallel»seanchfcistexeeuted,o^^ ascertain all word pairs in 

Arraylb that passed a^lMhe^test pattern searches. 

module Grp_rslt_reg_bank (Grr_inp_reg, Grr_mask_reg, Grr_srch_reg, 

Clock, Srch_enbl , Wrt_enbl , Wri tempos , 
Grr_hit_list) ; 



// Grr_inp_reg > is shifted as a bit column into a column of the Group-result 
register bank defined by Write_pos . This column write operation is activated on 
e negative edge of Clock when Wxrt_enbl is asserted. 

r_aaak_reg and Grr_8rch_reg compose a search pattern enacted on the negative 
ge of Clock when Srch_enbl is set. Pattern matches are indicated in 
Grar_hit_list . The Grr-_Jiit_list is also known as the Group-test Hit list.// 



parameter Grr_memii»sa^e= 81191 ; 
parameter Grr_word_si*ze=7 ; 
parameter Grr_wdth_spe©=2 ; 

i npu t [ G r r_mem_s i z e : 0 ] G r r_?i>np^r e g ; 

input [Grr_word_si*ze : 0 F-Grr^mask^reg rGrr-srch_i?eg ; 
input [Grr^wdth_spec : 0 Wr iterrpos ; 
input Clock, Srch_enbl,Wrt_enbl; 



tput [ Grr_mem_s i z e : 0 ] Grr_hit_list ; 
g [Grr_mem_size:0] Grr__hit_list ; 



reg [Grr__word_size : 0 J Grr_array [ 0 :Grr_mem_size] ; 
reg [Grr_word_s i z e : 0 ] Temp_reg ; 

integer Bank__index ; 

always @ (negedge Clock) 

if (Wrt_enbl) 
begin 

for (Bank_index=0 ; Bank_index<=Grr_ mem_size; 

Bank_index=Bank_index -+ 1) 

begin 

Temp_reg= Grr_array [Bank_index] ; 

Temp^eglWrite-iPos] = Grr__inp_reg [Bank_index] ; 

Grr.array [*Bank_i*ndex] =Temp_reg ; 
end 
end 

else if (Srch_enbl) 

for ( Bank_index= 0 ; Bank_index< =Grr_raem_s i z e ; 

Bank_index=Bank_index+l ) 

begin 

Temp_reg = Grr_array [Bank_index] ; 

if ( (-Grr_mask_reg | (Grr_srch_reg & Temp_reg) | 
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(~G 

Grr_ > hit_lis 
o ls e — 



r^^pch.reg & -Temp.reg) ) = = 8 ■ hf f )M^k 
t^^hk.index] = 1; 



Grr_hit.list [Bank.index] = 0; 



endmodule 

Multiple-response resolver (Version 1.0 Single Scan mode) 




Description: The Multiple-response resolver scans the Group-test Hit list ( a 1-bit column 
register). The resolver commences a scan by initialising its counter with the top address of the 
Hit list This counter serves as an address register which facilitates reading of every Hit list bit 
If the inspected bit is set, the fan-out list of the associated gate is accessed and updated 
appropriately. The bit is then reset After reset or if the bit was already zero, the counter is 
decremented to point to the next address in the Hit list The inspection process is repeated. The 
scanning terminates either when all bits have been inspected or all bits are zero. 

module Multiple.res.res (Grr.hit.l is t , Clock, 

Reset_c t r , End.scan.f lag , Decrmt.enbl , 
Fan_out.src.reg, Fan_out.size.reg, Rset.hit. f nd_f lg, 
Hit_fnd_f lag) ; 

The Multiple.response.resolver inspects a new bit of Grr_hit_lisfc on the 
negative edge of Clock while Decrmt_«ubl is asserted*. Raaet.ctr loads the 
resolver's counter with top location of Hit list. If the current inspected bit is 
set, Hit.fnd.f lag is asserted and the vector and the size (no. of gates) for the 
fan-out list loaded into Fan.out.src.reg and Fan.out_sizo.resr, respectively. 
Scanning halts and only recommences on the positive edge of Rset.hit.fnd.f lg which 
is externally controlled. Scanning terminates when all bits have been inspected of 
reset to zero. This condition is indicated by End.scan.f lag. / / 

parameter Grr.mem.size=8191 ; 
parameter Vectr.tbl_adr.reg.bits=13 ; 
parameter Fanout_hdr_tbl_wdth=13 ; 
parameter Max.f an.out=7 ; 
parameter Inp.bnk.size=i63 83 ; 



# 



iput Reset.ctr,Rset_hit.fnd.f lg, Clock; . 
Lput[Grr.mem.size:0] Grr.hit.list ; 

input Decrmt.enbl; 

output End.scan.f lag ; 
reg End.scan.f lag; 

output Hit.fnd.f lag; 
reg Hit.f nd.f lag; 

output Fan_out.src.reg; 

reg[Vectr_tbl_adr_reg_bits : 0] Fan.out_src.reg; 

ou t pu t Fan.ou t .s i z e.r eg; 

reg[Max_f an.out : 0] Fan_out_size.reg; 

reg ( Fanout_hdr.tbl.wdth : 0 ] Fan.ou t.hdr.tbl [ 0 : Inp.bnk.size] ; 

reg[Vectr.tbl_adr_reg.bits : 0) Hit.lst.ctr; 

reg[Max_f an_out : 0] Fan.ou t_size.tbl [0 : Inp.bnk.size] ; 
reg[Grr_men_size: 0] Hit.lst.buf f r ; 

reg Hit.f nd.ORed.f lg , Tst.or.bit ; 



integer Num_hits , Hit_di 



hit_dist, Prev_hit_lst_ctr, 



.dist; 




initial $readmemh( "Fansize.dat" , Fan_out_size_tbl) ; 

//The file Fansize.dat specifies the size of the fan-out list for each gate being 
simulated.// 



H 

m 



initial fo&e&er 
begin 

§ (Reset_ctr) 

if (Reset_ctr) 
begin 

Num_hits=0 ; 

Prev_hit_lst_ctr=Grr_mem_size; 
Sum_hi t_di s t = 0 ; 
Hit_lst_buf fr=Grr_hit_list; 
Tst_or_bit= | Grr_hit_lis t ; 
$display ( "OR Check=%b" , Tst_or_bit ) ; 
Hi t_lst_ctr=Grr_mem_s i z e ; 
End_scan_f lag=0 ; 

i t_f nd_f lag= 0 ; 
f it_fnd_ORed_flg=l; 

$display( "Initialisation seq executed"); 
end 
end 




always @ ( negedge^ ^ock ) 
begin 

if ( (Decnttt_enbl) && ( i**End*_scan_f*l»ag) ) 
begin 

Hi t_f nd_ORedI_£*lg= | Hi t_ls^buf f r ; 
if ( (Hit_lst^ot-r>0) && ( Hit_fnd_ORed_f lg) ) 
begin 

if (Hi t j*Qrs t Hbti f f r [ Hi t_l s t^c t r ] = = 1 ) 
begin 

Num_hits=Num__hits + 1; 

Hit_dis t=Prev_hit_lst_ctr - Hit__lst_ctr ; 
Sum_h i t _di s t =H i t_di st+ Sum_hi t_di s t ; 

$display ( -Hit distance=%d" , Hit_dist, "Time=%d u , $time) 
Pr e v_hi t_l s t_c t r =Hi t_l s t_c t r ; 

Fan_out_size_reg=Fan_out_size_tbl {Hit_lst_ctr ] ; 
Fan_out_src_reg=Fan_out_hdr_tbl[Hit_lst_ctr J ; 
Hit_fnd_f lag=l; 
Hit_lst_buf fr[Hit_lst_ctr]=0; 
end 

end 



if ( (Hit^lst_ctr>0) && (! Hit_fnd_ORed_f lg) ) 
begin 
Endascan_f lag=l ; 

$display ("No of hits in fan-out list=%d" , Num_hits) ; 

Avg_dist=Sum_hit_dist/Num_hits; 

$display( "Average hit distance=%d" , Avg_dist ) ; 



end 



if (Hit_lst_ctr==0) 
begin 

if (Hit_lst_buf fr[Hit_lst_ctr]==l) 



begin 

Num _ hits=Num_hi t9Vl ; 



tiii: aisc=^rev nic ist ctr-Mit iat ctr: 

SdisplayCHit distance=%d" , Hit_dist) ; 
Prev^hxt^ls.t^c^r^KLt^ls^-Gfe&.^- 










Sum tilt dist=Hit di^t+Sum "hit* rii^t"* 








Fan_out_size_reg=Fan_out_size_tbl [Hit_lst_ctr] ; 
Fan_out_src_reg=Fan_out_hdr_tbl [Hi t_lst_ctr] ; 
Hit_fnd_f lag=l; 









End_ scan_f lag=l ; 

$display ( "No of hits in fan-out list=%d" , Num_hits) ; 
Avg_dist=Sum_hit_dist/Num_hits ; 
$display( "Average hit distance=%d" , Avg_ dist ) ; 
end 

Hit_lst_ctr=Hit_lst_ctr -1; 
end 
end 

always ©(posedge Rset_hit_fnd_f lg) 
begin 

Hit_fnd_f lag=0; 
nd 



m 

endr 



ndmodule 
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M ultipleJResponse Resolver (Version 2.0 Multiple Scan Mode) 

Duaiipthm. Tin MulUpte»i^puitse ^3 j^bit column 

register). Tihe«esolver«nlMu^ Each is 

assigned an equal size»p0rtion»of«ithe-©roup*test?HifelisrWhen the resolves is initialised all scan 
registers point to thctop^^their^espectiveiHiUist^egment The-registers are synchronised by a 
single clock. The external.func-tionality o^the-Multiple Scan Mode resolver is identical to that of 
the Single Scan Mode version. Internally, the Multiple Scan version uses a Wait semaphore to 
queue multiple accesses to the the fan-out lists. Registers which clash are queued arbitrarily and 
only recommence scanning after gaining permission to update their fan-out lists. Scanning 
terminates when all bits have been inspected or all bits are zero. 



ule Multiple_resLres (Grr_hit_list , Clk, 

Reset_ctr , End__scan_f lag, Decrmt_enbl , 
Fan_out_src_reg, Fan_out_size_reg, Rset hit fnd fig, 
Hit_fnd_f lag) ; ~~ ■ ~ 



J J- , ± Mult iPl^Besponse^esofeer d*ispe©ts in *pa^allel .^several bits of 
Gr:r - h:Lt - :List on ^ e negative «**edge of 4Clock whMe ; •©•ermt^exibl ^.is asserted 
Restt.ctr loads the resolver/ s ^scan registers with the u top location of each 
respective seginent^f ^e*H^ are set. 

Hit fnd flag is asserted .^The^tor ..and^ie^size (no.^of^fees) for the fan-out 
list of the segment ^fofch ^fes ^ been ^f^anted permission, m±s ^loaded into 
Faja r out_src_r#g alid . Faa^t^g^e^eg, respectively; *-S'earl*iri«ng haibts for all 
registers awaitxng^ermdte^^ segment on 

the positive edge of We^Kit^nd£'fa ff -Iwhich is -externally controlled. For 
registers that have not found a hit, a new bit is inspected on the negative edge 
of # Clock. Scanning teminafees when all^bits have been inspected or reset to zero 
is condition is indicated by^End_scan„flag.// 



^^^^ 




parameter Grr_mem_size=8191 ; 
parameter Vectr_tbl_adr^reg_bits=13 ; 
'parameter Fanout_hdr_tbl_wdth=13 ; 
parameter Max_ f an_ou t = 7 ; 
parameter Inp_bnk_size=16383 ; 

input Reset_ctr , Rset_hit_fnd_f lg,Clk; 
input [Grr_rnem_size:0] Grr_hit_list ; 

input Decrmt_enbl; 

output End_scan_f lag; 
reg End_scan_f4ag; 

output Hit_fnd_f*lag; 
reg Hit_f nd_f lag ; 



output Fan_out_src_reg; 

reg [ Vectr_tbl_adr_reg_bits : 0 ] Fan_out_src_reg; 

output Fan_out_size_reg; 

reg[Max_f an_out : 0 J Fan_out_size_reg; 

reg[Fanout_hdr_tbl_wdth:0] Fan_out_hdr_tbl [ 0 : Inp_bnk_size) ; 
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reg [Max_ fan_out : 0)- Fan_out_size_tbl [0 : Inp_bnk_size] ; 

-reg[Grr_meirusij£ 

reg Hit_fnd_ORed_f Ig, Tst_or_bit ,Mpl_scan_enbl ; 

integer Num_hits , Num_hits__ratio, Start_time, Finish_time; 

rag riarrmt-^pnhl 1 , fiprrmr_pnhl? < rierrm1-_nnh1 3 , docrmt_enblia , mom_arrcess ; 
reg decrmt_enbl5 , decrmt_enbl6 , decrmt_enbl7 , decrmt_enbl8 ; 

reg decrmt_enbl25 , decrmt_enbl26 , decrmt_enbl27 , decrmt_enbl28 ; 
reg decrmt_enbl29 , decrmt_enbl30 ; 

//These registers enable a segment to be scanned when asserted. This program 
assumes that the list is divided into 3 0 equalled size segments.// 



integer Cl,c2,c3,c4,c5,c6,c7,c8; 

teger c25 , c26 , c27 , c28 , c29 , c30 , Total; 

Vectr_tbl_adr_reg_bits : 0 ] posl , pos2 , pos3 , pos4 , pos 5 , pos 6 , pos7 , pos8 ; 
reg [ Vec tr_tbl_adr_r eg_bi t s : 0 ] pos 2 5 , pos 2 6 , pps2 7 , pos 2 8 , po s 2 9 , pos 3 0 ; 
// These are the scan registers for each segment.// 



parameter 


upr. 


.ltl= 


149; 


parameter 


lwr_ 


ltl= 


0; 




parameter 


upr_ 


.lt2 = 


299; 




parameter 


lwr_ 


lt2 = 


150 4 




parameter 


upr_ 


_lt3 = 


449, 




parameter 


lwr_ 


.lt3 = 


300, 




parameter 


upr_ 


_lt4 = 


599, 




parameter 


lwr_ 


lt4 = 


450 




parameter 


upr_ 


lt5 = 


749 




^rameter 


lwr. 


.lt5 = 


600 




fcraraeter 


upr. 


lt6 = 


899 




Bameter 


lwr_ 


_lt6 = 


750 




parameter 


upr. 


.It 27 


= 4049 


parameter 


lwr. 


_lt27 


= 3900 


►parameter 


upr. 


_lt28 


= 4199 


parameter 


lwr_ 


_lt28 


= 4050 


parameter 


upr. 


_lt29 


= 4349 


parameter 


lwr_ 


_lt29 


= 4200 


parameter 


upr. 


_lt30 


= 4392 


parameter 


lwr. 


_lt30 


= 4350 



// These parameters define the upper and lower limits of the segments of the 
Group-test Hit list.// 



initial 
begin 
posl= 
pos2 = 
pos3 = 
pos4 = 
pos5 = 
pos 6 = 



upr_ltl 
upr_lt2 
upr_lt3 
upr_lt4 
upr_lt5 
upr_lt6 



pos27=upr_lt27 
po s 2 8=upr_l 1 2 8 
pos 2 9 =upr_l 1 2 9 



pos 3 0=upr _1 1 3 0 ; 



rlrrrmt rnbll-1 ■ 




decrm t_enbl 2 = 1 ; 
■—aiacrrntr fc-iiVill^l, 






dec rmt_enbl 4=1 ; 






decrmt_enbl5=l; 






dec rra t_enb 16=1 ; 






decrmt_enbl7=l ; 







decrmt^enbl%2 8 = 1 ; 
dec rmt _enbl 29=1; 
decrmt_enbl3 0= 1 ; 



cl=0 
c2=6 
c3=0 
c4=0 
c5=0 
c6=0 



^^^^ c 



c27=0; 

28 = 0; 

29 = 0; 
c30=0; 

: mem_access=l ; 
end 



initial Sreadmemh ( "Fanout^dat " , ^Ran^out^hdratbl) ; 

//The file Panout .dat -contains . tehe vectors for~the start of" the fan- out lists for 
every ga t e in the cir cui fc^bed*ng s<dsmu*l*a«t3ed . / / 

initial $ readmemh &&Eans^ze£»dat t^Ean^out^SrizeZtbl ) ; 

//The file FaMize^fct.^pecifctes^fe^ gate being 

simulated.// 

initial forever * 
begin 

(Reset_ctr) 
(Reset_ctr) 
egin 
' Num_hits = 0 ; 
Hit_lst_buf fr=Grr_hit_list; 

Tst_or_Jbit= | Grr_hit_list ; * 
$display ("OR Check=%b- r Tst_or_bit) ; 
End_s c an_f 1 ag= 0 ; 
Hit_fnd_f lag^f- 
Hit^fnd.JDRed.f lg=l ; 
posl=upr_ltl; 
pos2=upr_lt2 
pos3=upr_lt3 
pos4=upr__lt4 
pos5=upr_lt5 
pos6=upr_lt6 




pos27=upr_lt27 ; 
pos 2 8 =upr_l 1 2 8 ; 
pos29=upr_lt29; 
pos3 0=upr_l t3 0 ; 

decrmt_enbll=l ; 
decrmt_enbl2 = l ; 
dec rmt_enbl 3 = 1; 
decrmt_enbl4=l ; 
decrmt_enbl5 = l ; 
decrmt_enbl 6=1 ; 



- 28 - 



dec rmt_enbl 27=1; 
d e r r m t _r n hl2 8 =1^= 



dec rmt_enbl 29=1; 



cl=0; 
c2 = 0; 
c3=0; 
LI4- 0 , 
c5=0; 
c6=0; 

c27=0 
c28=0 
c29=0 
c30=0 



mem_ac c e ss = 1 ; 



mem - access=l ; 

$display ( "Initialisation seq executed"); 
S t ar t_t ime= $ t ime ; 
d 




always ©(posedge Decrmt_enbl) 
begin 

Mpl_scan_enbl=l ; 
end 

always @ (posedge Rset_hit_f nd_f lg) 
begin 

Hi t_fnd_f lag=0 ; 

men»_access=l; 

end 

always @ (liegedge Clk) 
begin 

if (! End_scan_f lag) 
begin 

Hit_fnd_ORed_f lg= |Hit_lst_buf f r; 

if (! Hit_fnd_ORed_flg) 
begin 

End_scan_f lag^j- 
Mpl^scar^enbl^O ; . 
end 
end 

if ( (Mpl_scan_enbl) && ( Hit_f nd_ORed_f lg) ) 
begin 

if (decrrat__enbll) 
begin 

if (Hi t_l s t_bu f f r [ po s 1 ] = = 1 ) 
begin 

Hit_lst_buf fr [posl] =0; 
decrmt_enbll=0 ; 
if (!mem_access ) 
begin 
cl=cl+l; 

$display(-Clashl cl=%d' ( cl); 
end 

wait(raera_access) ; 
mem_access=0; 
Num_h its =Num_h its + 1; 

Fan_out_size_reg=Fan_out_size_tbl [posl] ; 



Fan_out_s^Mreg=Fan_out_hdr_tbl [posl] ; 

Hit_fnd_flag=l; 

Hit 1st: buffrr P osl1=Q: 



-i_f (nosl >lwr ltl) 



Degxn 
posl=posl-l; 
decrmt_enbll=l ; 
end 



end 





else 
begin 

if (tgrosl >lwr_ltl) 
u biegdn 

wipos%=pos 1- 1 ; 
end 
else 
decrmt_enbll=0 ; 

end 

end 



if {decrmt_enbl3 0) 
begin 

if (Hit_lst_buf fr[pos30]==l) 
begin 

Hit_lst_buffr [pos30] =0; 

^eG^^t^elibl^'CysO ; 
i f (d!«memt_aGc ess ) 
*begin 
^g^0=g30^1; 

^^d^play t -G^snaO o3<0=%d" ,©30) ; 
^wend 

wa'it=*(*mem^aceess ) ; 
iffem*_ac G.e s s = 0 ; 

Nxun_bifes=Num_hi.ts + 1; 

Fan_out^size_reg=Fan_oufc^size_tbl [pos3 0] ; 
; -^ait:out^src3eg=Fartout Jtdr^tbl fposSO] ; 
Hit_fnd__flag=l; 
Hit_lst_buffr [pos30]=0; 

if (pos30 >lwr_lt30) 
begin 

pos30=pos30-l; 
decrmt:__enbl3 0=1 ; 
end 

end 

else 
begin 

if (pos30 >lwr_lt30) 
begin 

*pos30=pos3 0-l; 
—end 
e»l«se 

^deG^mt_ienbl3 0=0 ; 

end 

end 



end 

end 



always @ (posedge End_scan_f lag) 
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— 

♦ begin 
Finish_time=$time; 

f»n^ ===== == _ 

endmodule 



Fan-out Generator module 

Description: When a hit has been detected in the Group-test Hit list The address within the scan 

register selects a vector (From the Van-out hdr table) which locates the Start of a PJn-oul list for 

the current active gate. The address register of this module is loaded with the address of the 
header of the fan-out list The size of this fan-out list and the updated signal value to be 
transmitted is also conveyed to the module. The module proceeds to affect all changes in the fan- 
out lists. 

module Fan_out_gen ( Fan_out_load, Fan_out_gen_f lg, Reset _gen, Update_val_in, 

Clock, Updat e_val__out , Fan_out_size_reg, 
Fan_out_adr_reg , Ou t_adr_ r eg ) ; 

A^he address in Pan_out_vector_tbl of the header of the Fan-out list and the 
■f^Wjer of fan-put elements, are contained in Pan_out_adr_reg and Fan_oufc_size_reg 
^^^pctivelv . These are loaded on the positive edge of Fan_out_load . On the 

successive negative edge(s) of Clock the . address of a fan-out wire is generated in 
Out _adr_r e g . The end of a fan-out list is indicated when ( Paa_out^gen_f lg is set. 
This flag is cleared by the positive edge of Resefc_gen; : The signal value to be 
conveyed to the fan- out list is transferred to and transmitted by the module in 
Updat e_val_in and TJpdata_va.l_out , respectively.// 

parameter Vectr_tbl_wrd_size = 13; 
parameter Vectr_tbl_size = 16383; 
parameter Inp_val_wdth=2 ; 
parameter Max_ f an_ou t = 7 ; 
parameter Vectr_tbl_adr_size=13 ; 

input Fan_out_load, Reset_gen, Clock; 
^^yiput [ Irip_ val_wdth : 0 ] Update_val_in; 
MBput [Max_fan_out : 0] Fan_ out_size_reg; 
BT^But tVectr_tbl_adr_size:0] Fan_out__adr_reg; 

output Fan_ou t_gen_ f 1 g ; 
J^^reg Fan out gen fig; 

output [ Inp_val_wdth : 0 ] Updat e_val_out ; 
reg (Inp_val_wdth: 0] Updat e_val„out ; 

output [ Vec t r_t bl_wr d_s i z e : 0] Out_adr_reg; 
reg [Vectr_tbl_wrd_size:0] Out__adr_reg; 

reg[Vectr_tbl_wrd_size: 0] Fan_out_vector_tbl [ 0 : Vectr_tbl_size] ; 
reg [Vectr^tbl_wrd__size : 0 ] List_pos ; 
reg[Max_fan_out:0] Counter; 

initial Sreadmemh ( " Fanvcr . dat " , Fan_out_vectqr_tbl ) ; 

//Fanvcr.dat contains the vectors of the signals in the fan-out lists for every 
gate. // 

initial forever 
begin 
% (Reset_gen) 




if (Reset_gen) 
begin 

Fan_ou t_g on_f lg - Q ; 



end 



always ©(posedge Fan_out_load) 
begin 

if (!Reset_gen) 
begin 

; Lamm* m cm_ouL^a.l*jB^reg ; 

Lis t_pos ; ^Faps©ut3^a'di»35g f g ; 
Update_vai^oti t^Ugda^ei^val^in ; 
Fan_out^gen_f lg=l ; 
end 
end 



always ©(negedge Clock) . 
begin 

if (!Reset_gen && Fan_out_gen_f Ig) 
begin 

V k (Counter>0) 
1 Ibeg in 

^^Fout_adr_reg=Fam_out_vector_tbl [List_pos] ; 
List_pos=List_pos+l; 
Count er=Counter- 1 ; 
end 
else 

Fan_out_gen__f lg= 0 ; 

end 
end 
endmodule 



Input-value Bank 

Description: The bank contains the current values of all the signals in the circuit Each location 
in the bank corresponds to a wire. Since a word at any location is 3 bits wide, up to 8-vaIued 
logic can be simulated (this can be augmented by increasing the word width). The current value 
of any wire is shifted from this bank into Array_lb when time is incremented. This is done in 
parallel. Only wire values that have changed in the current time interval are updated. 

module Input_val_bank ( Inp_val_reg, Adr_r eg, Clock, Shf t_enbl , Wrt_enbl , 

Out_buf f r_reg) ; 

/ / Inp_val_r eg contains the new value of a signal (i.e. word> in Inp_val_ary. The 
location of the wire is specified in Adar_reg and the write operation takes effect 
on the negative* edge of Clock if Wrt_exrbl is asserted. If Shft__*nbl is asserted 
then the right-most bit of every location is . shifted into the 1-bit column - 
^register Out_buf f r.rag on the positive edge of Clock. All shifted bits are also 
■Jfjfcitten into the right-most bit of Inp_val_ary (i.e a rotation); thus all current 
■ ,^JWies have been retained after the shifting out process. // 

parameter Inp_val_wdth=2 ; 
parameter Adr_reg_ bits=13 ; 
parameter Inp_bnk_size=163 83 ; 
parameter Lsr7552_Inp_bnk__size=87 84 ; 

input Clock, Shf t_enbl , Wrt_enbl ; 
ihpu t [ I np_va l_wd t h : 0 ] Inp_va l_r eg; 
input [Adr_reg_bits : 0] Adr_reg; 

output [Inp_bnk_size : 0 ] Out_buffr_reg; 
reg [Inp_bnk_size: 0) Out_buf f r_reg; 

r e g [I np.va l_wdth : 0 ] I np_va l_ary [ 0 : Inp_bnk_s i z e ] ; 

• [ Inp__val_wdth : 0 ] Temp_r eg ; 
Temp_bit; 

integer Inp_ary_indx, i ; 

■ initial $readmemb ( " Inpval . dat " , Inp_va l_ary ) ; 

/ / Znpval - dat is the file which initialises the current input values of all gates 
in the simulated circuit. All values are assigned % Unknown' logic values except 
those primary inputs which are assigned logic % 0' or *!'.// 

always 8(posedge Clock) 
begin 

if (Shft_enbl) 
. begin 

for (lnp_ary_indx=0 ; Inp_ary_indx< =Ls r7 5 5 2_Inp_bnk_s ize; 

I np_a ry_ i ndx= Inp_ary_ i ndx + 1 ) 

begin 

Temp_reg=Inp_val_ ary [ Inp_ary_indx} ; 
Temp_bi t =Temp_reg [ 0 ] ; 

Out_buf f r_reg [Inp_ary_indx] =Temp__bit ; 
Temp_reg [1:0] =Ternp_reg [ Inp_val_wdth : 1 ] ; 
Temper eg [ Inp_val_wdth] =Temp_bi t ; 
Inp_val_ary [ I np_a ry _i ndx } =Temp_reg ; 
end 

$display ( ■ (shft) time=%d" , $time) ; 

end 



else 



end 



if (Wrt.enbl) 
begin 

Inp_val arv f Adr real =Inp val reg 
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endniodule ~ 

The Sequence Logic of the APPLES Processor 

parameter Nibl=3; 
puxmngtau Aiy.la.wdth " 7 , 



parameter Ary^lbJadrjr^gmwdteii^lB ; 
parameter Ary_lai.si r 2'e^r6i3 8 3 ; 
parameter ftry.lb_si' k ze==C63 83 ; 
parameter Eval _ptrri^t bl^sd»ze=63 ; 
parameter Bval^tcn^etr^t^l_sd«»ze=3 1 ; 
parameter Num_tst.wdth=7 ; 
parameter Num.tst_ptrn.tbl_size=31; 
parameter Gate.maskla.tbl.size=31 ; 
parameter Gate.inptla.tbl.size=31 ; 
parameter Trr_ptrn_tbl.size=3 1 ; 
- parameter Gr r_pt rn.tbl_s i z e- 3 1 ; 
parameter Ou t _va l_tbl_s i z e = 3 1 ; 

to 

lS'# uneter Wlr.wr ds i z e = 3 1 ; 
^Bfaineter Trr_wdth_spec=2 ; 

parameter Trr.word_size=7 ; 

parameter Gr r_ntem_s i z e= 8 1 9 1 ; 

parameter Grr_wdth^spee=2 ; 

paramefer Grr_worii^sr z e = 7 ; 

parameter Iu_wordwjsize=7 ; 

parameter Iu_wdth.spee=2 ; 

parameter Vectr_,tl^^adr^reg=13 ; 

parameter Max.f aniout=7 ; 

parameter Inp.valt_wdt h=2 ; 

parameter Vectr_t^^adr^.size=163 83 ; 



parameter Index_reg.wdth=7 ; 

parameter Num.tst_seq=12 ; //No of gates X No Transitions 
rameter Num_tst_ v cnt^Wdth=3 ; 
ameter Iriit_shf t.val=3 ; 
ameter Shf t_ cnt^wdth=3 ; 



wire Clock; 

wire[Ary.la.size:0] Wrd.ln^activ.ls t Trr.bnk.inp.reg ; 

wire[Ary.lb.size:0] Inval_unit_out_reg; 

wire [Grr_mem_size : Q] Grr.bnk.inp.reg, Grr.bnk_hit.lst ; 

wire [Max.f an_out:0] Mrr_uni t_fan.out_size.reg ; 

wire [ Vec tr_tbl.adr.reg : 0 ] Mrr_unit_£an_out_src_reg ; 

wire { Inp_val_wdth : 0 ] Fo _g en_un i t _va 1 _ou t ; 

wire [ Vectr_tbl.adr.size : 0 ] Fo_gen_unit_out_adr_reg ; 

reg Tst.seq.strt; 

reg eO , el , e2 , e3 , e4 , e5 , e6 , e7 , e8 , e9 , elO , ell ,612,613,614, 

el5,el6,el6a,el6b,el7 / el8,el9,e20,e21,e22,e23,e24,e25,e26,e27, e2 8,e29, 
Deact_srchla , Gate_eval.ini £_pr-oc ; . 

reg [ Index.reg_.wdth : 0 ] Ep t.i.,*Epvt.i v-Ntpt_i^<3mla t.i , Gi lat.i , 

Tpt.i , Gr i t.i , Grmt_i , Ovt.i ; 

reg[Wlr_wrdsize : 0] Eval_ptrn_tbl [0 : Eval _ptrri_tbl_size] ; 

reg [ Wlr.wrdsize : 0 ] Eval _ptrn_vc tr.tbl [ 0 : Eval_ptrn.vctr.tbl.size ] ; 

reg[Num_tst.wdth:0] Num.ts t_ptrn_tbl [ 0 :Num_tst_ptrn_tbl_size] ; 

reg[Ary_la.wdth: 0] Gate.maskla.tbl [ 0 :Gate.maskla.tbl.sizeJ ; 

reg ( Ary.la.wdth : 0 ] Gate.inpt la.t-1 ( 0 : Gate.inptla.tbl.size] ; 

reg[Trr.word_size:0] Trr_ptrn_tbl [ 0 :Trr_ptrn_tbl_size] ; 

reg[Grr.word.size:0] Grr_inpt_tbl [ 0 :Grr _ptrn.tbl.size] ; 
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reg [Grr_-word_size : 0] Grr_^^k tbl ( 0 :Grr_ptrn_tbl_size] ; 



reg [ Inp_val_wdth : 0 J Out_^W tbl [ 0 :Out_val_tbl_size] ; 



reg {Grr_word_size: 0] Grr_bnk_search_reg,Grr_bnk_jnask_reg; 
reg [Grr_wciLh_spec : o ] Grr_bnk_wtr jws ; 
reg [ Trr_wdth_spec : 0 ] Tr r_bnk_wr t_pos ; 

reg [Trr_word_size : 0 ] Trr_rsl t_act_reg, Trr_rsl t_act_and_0 ; 
rag [ Tii_wnr<i_r i ao : 0 ] Inval_uni t_adr_rag ; 



reg [ Iu_wdth_spec : 0] Fo_gen_unit_val_in, Inval_unit_in_reg; 

reg Search_ary_la , Wri te_enbl_la, Ary_lb_wrt_enbl , Wlr_bnk_search_enbl , Shf t_ary_lb, 
Ary_lb_rd_enbl , Trr_bnk_wrt_enbl , Tr r_bnk_c omb_enb 1 , Trr_bnk_rset , 
Grr_bnk_search_enbl , Grr_bnk_wrt_enbl , Mrr_unit_rset , Mr r_uni t_decrmt_erLbl , 
Mrr_unit_rset_hit_fnd_f lg, Fo_gen_unit_load, Fo_gen_unit_rset , 
Inval_unit_shf t_enbl , Inval_unit_wrt_enbl ; 

reg[Ary_la_wdth: 0) Inp_regla, Mask_regla, Adr_regla; 

reg [Wlr_wrdsize : 0 ] Inp_reg_lb, Search_reg_lb, Mask__reg_lb; 

reg [ Ary_lb_adr_reg_wdth : 0 ] Adr_reg_lb; 

[ Num_t s t_cnt_wdth : 0 ] Num_ts t_cnt ; 

reg[Shf t_cnt_wdth: 0J Shft_cnt; 




Ary_la Ga t e_id_bnk ( Inp_r egla , Mask_regla , Adr_regla , Clock , 

Search.ary.la, Write_enbl_la, Wrd_ln_activ_lst) ; 

Ary_lb Wrd_ln_reg_bnk ( Search_reg_lb, Mask_reg_lb, Adr_reg_lb, 

Inp_reg_lb, Out_reg_lb , Tr r_bnk_inp_reg , Shf t_ary_lb, 
Wlr_bnk_search_enbl , Ary_lb_wrt_enbl , Ary_lb_rd_enbl ,• 
Clock, Inval.uni t_ou t_r eg , Wrd_ln_ac t i v_l s t ) ; 

Ts t_r s 1 t^r eg_bank Trr_bnk (Trr_ bnk_inp__reg, Trr_bnk_ wrt_enbl , Trr__bnk__comb_enbl , 

Clock , Grr_bnk_inp__r eg , Tr r_rs 1 t_ac t__r eg , 
Tr r_bnk_wr t_p o s , Trr_bnk_r set); 



>_rslt_reg_baiik Grr_bnk (Grr_bnk_inp_reg, Grr_bnk_jmask_reg, 

Grr_bnk_search__r eg , Clock , Grribnk„search„enbl , 
Grr_biik_wrt_eribl , Grr_bnk_wrt_pos , Grr_bnk_hit_lst ) ; 

Multiple^ res_res Mrr_unit (Grr_bnk_hit_lst , Clock, Mr r_unit_rset , 

Mr r _uni t_ehd_s c an. f 1 g , Mr r_uni t_dec rm t _enb 1 , 
Mrr_unit_fan_out_src_reg , 
Mrr_unit_f an_out_size_reg, 
Mrr_unit_rset_hi t_f nd_f lg , 
Mrr_unit_hit_fnd_f lag) ; 

Fan_out_gen Fo_gen_unit (Fo_gen_unit_load, Fo_gen_unit_f lg, Fo_gen_unit_rset, 

Fo_gen_unit_val_in , Clock , Fo_gen_uni t_val_out , 
Mrr_unit_f an_out_size_reg , Mrr_uni t_f an_out_src_reg , 
Fo_gen_unit_out_adr_reg) ; 

Input_val_bank Inval unit(Fo gen unit val out , Fo_gen_unit_out_adr_r eg, Clock, 

Inval_unit_shf t_enbl , Inval_unit_wrt_enbl , 
Inval_unit_out_reg) ; 

Ck_gen Clk_unit (Clock) ; 



integer i , Tst_num, iter_cnt; 



initial 
begin 

$ readmemb ( " Ep_tbl . dat " , Eval_ptrn_tbl ) ; 
' ^^^^^^^^^^^^^^^^^^ 



Sreadmemh ( "Epv_tbl . dat" , Eval_ptrn_vctr__tbl) ; 
$display ("Epv_tbl.dat loaded.-); 

$readmemh( "Ntp_tbl. dat" ,Nunutst_ptrn_tbl) ; 
$display{ "Ntp_tbl.dat loaded."); 
$readmemb( "Gila_tbl .dat ■ , Gate_inptla_tbl) ; 
Odiaplay (■Gila^tbl.duL luuded. ■ ) ; 



$ readmemb (^Gmaa^febl«idat %*Gat^»roaskl^t.bl ) ; 
$ display (**Gmla£ &bl . dat ^1-oa'dedT ■ ) ; 
$readmemb( "Tp^tbl^dat" ,Trr_ptrn_^tbl ) ; 
$display ( "Tp_tbl**rdat loaded."); 
$readmemb (^Gi^t-bl-. dat w ^Grr_inp.fe^tbl) ; 
$display ( "Gi_tbl-rdat loaded. " ) ; 

$ display ( " Gi tbl . dat loaded."); 

$ readmemb ( ^Gra_tbl . dat " Grr_mask_tbl ) ; 
$display ( "Gm_tbl .dat loaded."); 
$readmemb( "Ov_tbl.dat" , Out_val_tbl) ; 
$ display ( "Ov_tbl .dat loaded. " ) ; 

$display( "Table initialisation sequence completed"); 




ate_eval_init_proc=l ; 
ter__cnt=0 y 

um_tst_cnt=Num_tst_seq; 
Inval_unit_shf t_enbl=0 ; 

Ept_i-8 'hOO; Epvt_i=8 'hOO; Ntpt_i=8 * hOO ; 
Gmlat_i=8 1 hO 0 ; ~G£&at_^-= 8-Mi0O ; mte£^=B$*m 0 ; 
Grit_i=8 1 hOO ; Grmt^i=8 1 h>00 ; '*Ov&3$& 8 w *tt00 ; 
end 

always @ (negedgeXGlock) . 
if (Gate_eval_in^it2iproc) 
begin 

$display("Gate Si eval_ini,^proei* a *#ime^%d^$^ime) ; 
iter_cnt=iter^cnt+l ; 

$display( "Iteration oo"unt=.%d"^ter^Gnt) ; 
Gate_eval_ihit_proc=0 ; 
Deact_srehla=0 ; 

e0=0; el=0; e2 = 0; e3=0; e4=0; e5=0; e6=0; 
e7=0; e8=0; e9=0; el0=0; ell=0; el2=0; el3=0; 
el4=0; el5=0; el6=0; el6a=0; e!6b=0; el7=6; 
el8=0; el9=0; e20=0; e21=0; e22=0; 

Inp_regla=Gate_inptla_tbl [Gilat_i] ; 
Mask_regla=Gate_maskla_tbl [Gmlat_i ] ; 
Tst_num=Num_tst_ptrn_tbl [Ntpt_i] ; 
Ept_i=Eval_ptrn_vctr_tbl [Epvt_i] ; 
Mrr_unit_decrmt_enbl=0 ; 
Tst_secj_strt=l; 
Wlr_bnk_search_enbl = 0 ; 
lnval_unit_wrt_enbl=0 ; 
end 

. always @ (posedge Clock) 
begin 

if (Tst_seq_strt) 
begin 

Tr r_bnk_r s e t = 1 ; 
Search_ary_la=l ; 
eO=l; 

Tst_seq__strt=0 ; 
end 
end 
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always £(negedge Clock) 
begxn 

— X-f ( O 0 ) ■- ■ 



begin 

Deact_srchla=l ; 
end 
end 



always etpuamiyy Liuckj 

begin 

if (Deact_srchla) 
begin 

Trr_bnk_rset=0 ; 
Deact_srchla=0 ; 
Search_ary_la=0 ; 
el=l; 

i =Trr_wo r d_s i z e ; 
end 
end 



W 





ways @{negedge Clock) 
egin 
f (el) 
begin 
el=0; 
e2 = l; 
end 
end 



always @ (posedge Clock) 
begin 
if <e2) 
begin 

Wlr_bnk_search_enbl=l ; 
Search_reg_lb=£val_ptm_tbl [Ept_i] ; 
Mask_reg_lb=Eval _ptrn_tbl [Ept_i+1] ; 
e2=0; • 
e3 = l; 
end 
end 

always ©(negedge Clock) 
begin 
if (e3) 

begin 

e3=0; 

e4=l; 

end 
end 



always ©(posedge Clock) 
begin 
if (e4) 
begin 

Trr_bnk_wrt_enbl = l ; 

Trr_bnk_wrt_pos = i ; 

Wlr_bnk_s ear ch_enbl = 0 ; 

e4=0; 

e5=l; 

end 

end 



ilways @(negedge Clock) 
begin 
if 



begin 




always @(posedge ©froek) 
begin 
if (e6) 
begin 

Ts t^nunv=Ts t^nuni^ 1 ; 

i=i-l; 

e6=0; 

if (Tst_num> 0 ) 
begin 
el=l; 

Ept_i=Ept_i+2 ; 

$display ( w Ept_i (updated) =%d" , Ept_i) ; 
Trr_bnk^wrt_enbl=0 ; 
end 
else 

begin 

Trr_bnk_wrt_eni>l=0 ; 
i=Trr_word_size; 

Trr^rs 1 t*ac t3&eg=Trjr^p fer n~tbr!KPpt_i ] ; 
Tst^uin=Nujn^st^trn^tbl:iONtpt^i] ; 
e7 = l; 
end 




end 



end 



always 8 (negedge Clock) 
begin 
if (e7) 
begin 
e7=0; 
e8=l; 
end 
end 





lways @(posedge Clock) 
begin 
if (e8) 
begin 

Tr r _bnk_c omb_e nb 1=1 ; 
Trr_bnk_wrt_pos=i ; 
e8 = 0; 
e9=l; 

$di sp 1 ay (. " Commencement of TRR tests for Gate type=%b" , Inp_r*egla , " at 
time=%d" , $time) ; 

end 
end 



always @ (negedge Clock) 
begin 
if (e9) . 

begin 

e9 = 0; 

elO=l; 

end 
end 
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always @(posedge Clock) 
beai n 




c_comr>_enb±=u ; 
Grr_bnk_wr t_enbl = 1 ; 
Grr_bnk_jwr t_pos=i ; 
elO=0; 
ell=l; 



end 

always ©(negedge Clock) 
begin 
if (ell) 

begin 

ell=0; 

el2=l; 

end 
end 



axwa 




always ©(posedge Clock) 
begin 
if (el2) 
begin 

Ts t_rnim=Ts t_num- 1 ; 
i=i-l; 
el2=0; 

if (Tst_num>0) 
begin 
e9=±; 

Tr r_bnk_c omb_enbl = 1 ; 
Tr r_bnk_wr t _ po s = i ; 
Grr_bnk_wrt_enbl=0 ; 
end 
else 
begin 
el3=l; 

Grr_bnk_wr t_enb 1=0; 
end 

end 
end 

always @ (negedge Clock) 
begin 
if (el3) 
begin 
el3=0; 
el4=l; 

$display( "Termination of Trr tests for Gate type=%b" , Inp_regla, " at 
time=%d" , $time) ; 

end 
end 




always @ (pos edge Clock) 
begin 
if (el4) 
begin 

Grr_bnk_search_reg=Grr_inpt_tbl [Grit_i] ; 

Grr_bnk_mask_reg=Grr_mask_tbl [Grmt_i] ; 

Grr_bnk_ search_enbl = l ; 

Fo_gen_unit_rset=l ; 

el4=0; 

e!5=l; 

end 

end 



always @ (negedge Clock) 
begin ' 
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if (el5) 

el6=l; 
end 
end 



alwiy? fl(rQcr?dgp C^nr\r) 



begin 
if (el6) 
begin 

Mrr_unit_rset=l ; 
el6=0; 
el6a=l; 
end 
end 

always ©(negedge Clock) 
begin 
if (el6a) 
begin 

Mrr_unit_rset=0 ; 
el6a=0 ; 
el6b=l; 
end 
end 

// Propagate val-ues^to gates- af€-ectedin>-fan-out 1 




ists . 



always &(posedge G^ock) 
begin 
if (el6b) 
begin 

Grr_bnk_searchtsenbl=0 ; 
Mr r_uni t_decrrot^enbl = 1 ; 
Fo_gen_uni t^iss e t = 0 ; 

Fo_gen_unit_vaL_in=Out_val_tbl [Ovt_i] ; 
el6b=0; 
e!7=l; 

$display ("Start of fanout list at time=%d" , $time) ; 
end 

end 




Iways & (negedge Clock) 
begin 
if (el7) 
begin 

Fo_gen_unit_loacl=0 ; 
el7=0; 
el8=l; 
end 
end 



always . @(posedge Clock) 
begin 
if (el8) 
begin 

i f ( Mr r_uni t_hi t_f nd_f lag ) 
begin 

Fo_gen_un i t_load= 1 ; 

el8=0; 

el9=l; 
end 
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else 



if ( ( !Mrr_unit_hit_f nd_f lag) & (Mrr_unit_end_scan_f lg) ) 



begin 

.el.a=0.; 



e22^1; 

Mrr_unit_decrmt_enbl=0 ,- 
end 



end 

end 

always @(negedge Clock) 
begin 
if (el9) 
begin 
Fo_gen_uni t_l oad= 0 ; 
Inva 1 _uni t _wr t _enb 1 = 1 ; 
Mrr_unit_rset_hit_fnd_f lg=0 ; 
el9=07 
e20=l; 
end 
nd 




ays @(posedge Clock) 
begin 

if (e20) 
begin 

if ( ! Fo gen unit fig ) 
begin 

if ( ! Mrr_uni t_end_s c an_f lg ) 
begin . 

Mrr_unit_xset_hit_f nd_f lg=l; 
Inva l_un i t_wr t_enb 1=0 ; 
e20=0; 
e21=l; 
end 

else 

begin 

Inva l_un i t_wr t_enbl = 0 ; 
e20=0; 
e22=l? 
end 
end 
end 

end 

always @(negedge Clock) 
begin 
if (e21) 
begin 
el8=l; 
e21=0; 
end 
end 

always @(negedge Clock) 
begin 
if (e22) 
begin 
e22=0; 
e23=l; 

Epvt_i=Epvt_i + l ; Ntpt_i=Ntpt_i + l ; 
Gmlat_i=Gmlat_i+l ; Gilat_i=Gilat_i+l ; 



Tpt^i=Tpt_i+l; ^ 
Grit_i=Grit_i+l; GriKE_i=Grmt_i+l ; 
Ovt_i =Ovt_i + 1 ; 



$display( "Termination of Fan out update, time=%d" , $time) ; 



always ©(posedge Clock) 
begin 
if (e23) 



e23=0; 

Num_t's t^cntr^Num^t s taicnt- 1 ; 

if (Nun\^tstt:crits==0) 
*begin 

e24=l; 

end 
else 

Gate_eval_init_proc=l ; 
end 
end 



alwa; 

«eg 



always @{negedge Clock) 
eg in 

(e24) 
egin 

$display ( "E24 attained, End of fanout update. " ) ; 

$display(" : -) ; 

Inval_uni t_sh f t_enbl = 1 ; 
Shf t_cnt=Init^fs%f fe*val ; 
e24=0; 
e25=l; 
end 
end 




/ / Inpu t_va l_bank -is ^ve edge trdgge^ed 1 Thus^next bOrock^is -ve edge . 

always @ (posedge^SOi'bek) 
begin 
if (e25) 
begin 

$display ( "E2 5 attained " ) ; 
Shft_ary_lb=l; 
e25=0; 
e26=l; 
end 

end 



always @(negedge Clock) 
begin 
if{e26) 
begin 

$display ( "E26 attained " ) ; 
Shf t_cnt=Shf t_cnt-l ; 
if (Shf t_ente==0) 
begin 
e26=0; 

Inval_unifc-shf fe^enbl=0 ; 
e27=l; 
end 

end 
end 



always @(posedge Clock) 
begin 
if (e27) 
begin 
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Shf t__ary_lb=0 ; 

e27=0; 

e28=l; 



end 

.end. 



always @ (negedge Clock) 
begin 
if (e28) 
begin 
a28-Q,- 




e29=l; 
end 
end 

always @(posedge Clock) 
begin 
if (e29) 
begin 

Gate_eval_init _proc=l ; 
Num_tst_cnt=Num_tst_s eq ; 

Ept_i=8'h00; Epvt_i=8 • hOO ; Ntpt_i=8 1 hOO ; 
Gmlat_i=8 *h00; Gilat_i=8 1 hOO ; Tpt_i=8 ■ hOO ; 
Grit_i=8 'hOO; Grmt_i=8 1 hOO ; Ovt_i=8 f h0p; 
e29=0; 
end 
end 



endmodule 




Further simulations were carried out, again with a Verilog model of APPLES 
simulated 4 ISCAS-85 benchmarks, 07552(4392 gates), 02670(1736 gates), 
nqnaM?ftfi natpsi raanrfi?? gates) n^ing * ,,nit H Q io y p ar h wag 

exercised^ith^^QBrandomMinpu^vectors over a*time?period rapging from 1,000 to 
1O,OO0snnachine cycles.- Statistics ^were gathered as-the number of scan registers 
varied"-fF©m 1 to v ^50 . ^he ^Speedup -relative tcrihe number of scan registers is 
shown in Table 1 . 
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No. Scan Registers No. Scan Registers 

1 15 30 50 1 15 30 50 

C7552 1 12.5 19.9 24.3 1 13.6 24.3 29.6 

C2670 1 9.7 13.8 15.9 1 12.5 20.0 25.1 

C1908 1 8.4 10.8 11.8 1 11.8 17.3 20.9 

C880 1 7.8 8.3 9.7 1 U.l 12.6 15.9 

^Speedup «*Speedup(excl Fixed«size 

^0verheads) 

(a) (b) 
Table 1 .-Speedup Performance (^Benchmarks 



Table (1.a) demonstrates that in general the speedup increases with the number of 
scan registers. The fixed sized overheads of gate evaluation, shifting inputs etc, 

15 tends to penalise the performance for the smaller circuits with a large number of 
registers. A more balanced analysis is obtained by factoring out all fixed time 
overheads in the simulation results. This reflects the performance of realistic, large 
circuits where the fixed overheads will be negligible to the scan time. Table (1.b) 
details the results with this correction. As expected this correction has lesser affect 

20 on the larger bench mark circuits. 



A v. No. Cycles/Gate Processed 
^No. Scan Registers 
1 15 30 50 
C7552 154.6 11.3 6.4 5.2 

C2670 101.9 8.0 5.1 3.9 
C1908 86.9 6.8 5.1 3.9 
C880 49.9 4.9 4.2 3.6 



Table 2. Average No. of machine cycles per gate processed 
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Taking the corrected simulated performance statistics, Table (2) displays the 

av/grego nnmhpr nf marhino ryrlp<; PYppndPd to nmc.vss a natfi Thft APPLES 

system detects intrinsically only active gates, no futile updates or processing is 
5 executed. The data takes into account the scan time between hits and the time to 
update the fan-out lists. As more registers are introduced the time between hits 
reduces and the gate update rate increases. Clashes happen and active gates are 
effectively queued in a fan-out/update pipeline. The speedup saturates when the 
fan-out/update rate, governed by the size of the average fan-out list, equals the 
10 rate at which they enter the pipeline. 

The benchmark performance of the circuits also permits an assessment of the 
validity of the theory for the speedup presented in Eqts(7) and (4). From the 
speedup measurements in Tablel.(b) the corresponding value for f av was 

15 calculated using Eqt(7). This value representing the average fan-out update time in 
machine cycles, should be constant regardless of the number of scan registers. 
Furthermore, for the evaluated benchmarks the fan-out ranged from 0 to 3 gates 
and the probability of a hit, Probst, was found to be 0.01 + 5%. Within one and a 
half clock cycles it is possible to update 2 fan-out gates, therefore depending on 

20 the circuit f av should be in the range 0.5 to 1.5. The calculated values f a » for are 
shown in Table 3. 







No. Scan Registers 






15 


30 


50 


Av 


C7552 


0.41 


0.35 


0.88 


0.55 


C2670 


0.52 


0.79 


1.26 


0.86 


C1908 


0.77 


1.21 


1.32 


1.10 


C880 


0.16 


1.98 


1.54 


1.22 






f av 
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Table 3. The Average Fan-out Update Time (in machine cycles) for the 
Benchmarks 




=4S= 



The values for f av are in accord with the range expected for the fan-out of these 
circuits. The fluctuations in value across a row for f av , where it should be constant 
are possibly due to the relatively small number of samples and size of circuits, 
where a small perturbation in the distribution of hits in the hiWist can affect 
5 significantly the speedup figures. In the case of C880, a 10% drop in speedup can 
effectively lead to a ten-fold increase in f av . 

The APPLES architecture is designed to provide a fast and flexible mechanism for 
logic simulation. The technique of applying test patterns to an associative memory 

10 culminates in a fixed time gate processing and a flexible delay model. Multiple 
scan registers provide an effective way. of parallelising the fan-out up-dating 
procedure. This mechanism eliminates the need for conventional parallel 
techniques such as load balancing and deadlock avoidance or recovery. 
Consequently, parallel overheads are reduced. As more scan registers are 

15 introduced, the gate evaluation rate increases, ultimately being limited by the 
average fan^out list size per gate and consequently the memory bandwidth of fan- 
out list memory. 

The APPLES architecture incorporates an alternative timing strategy which obviates 
20 the need for complex deadlock avoidance or recovery procedures and other 
mechanisms normally part of an event-driven simulation. The present invention has an 
overhead which is considerably less than conventional approaches and permits gate 
evaluation to be activated in memory. The reduction in processing overheads is 
manifest in improved speedup performance relative to other techniques. 

25 

A message passing mechanism inherent in the Chady-Misra algorithms has been 
replaced by a parallel scanning mechanism. This mechanism allows the fan- 
out/update procedure to be parallelised. As clashes occur gates are effectively put into 
a waiting queue which fills up an fan-out/update pipeline. Consequently as the pipeline 
30 fills up(with the increase number of scan registers), performance increases. The 
speedup reaches a limit when the new gates entering the queue equals the fan-out 
rate. Nevertheless, the speedup and the number of cycles per gate processed is 
considerably better than conventional approaches. The system also allows a wide 
range of delay models. 




The bit-pattern gate evaluation mechanism in APPLES facilitates the implementation 

of simple and complex delay models as a series of parallel searches. Consequently, 

the evaluation process is constant in time, being performed in memory. Effectively, 
5 there is a one to one correspondence between gate and processor (the gate word 
pairs). This fine grain parallelism allows maximum parallelism in the gate evaluation 
phase. Active gates are automatically identified and their fan-out lists updated through 
scanning a hit-list This scanning mechanism is analogous to Communication 
overhead in typical parallel processing architectures, however, this scanning is 
10 amenable to parallelisation itself. Multiple scan-registers reduce the overhead time 
and enable the gate processing rate to be limited solely by the fan-out memory 
bandwidth. The substantial speedup of the logical simulation with the APPLES 
architecture is attained resulting in a gate processing rate of a few machine cycles.# 

15 In this specification, the terms "comprise", "comprises" and "comprising" are used 
interchangeably with the terms "include", "includes" and "including", and are to be 
afforded the widest possible interpretation and vice versa. 



_ The invention is not limited to the embodiments hereinbefore described which may be 

^^^^ 20 varied in both construction and detail. 

H 
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